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1 Introduction and overview 

Regular expressions have been studied for almost fifty years, yet many interesting and 
challenging problems about them remain unsolved. By a regular expression, we mean a 
string over the alphabet £ U{+, *,(,), £,0} that represents a regular language. For 
example, (0 + 10) * (1 + s) represents the language of all strings over {0,1} that do not 
contain two consecutive l's. 

We would like to enumerate both (i) valid regular expressions and (ii) the distinct 
languages they represent. Observe that these are two different enumeration tasks: on the 
one hand, every regular expression represents exactly one regular language. On the other 
hand, simple examples, such as the expressions (a + b)* and (b * a*)*, show that there is 
no one-to-one correspondence between regular languages and regular expressions. 

We are in a similar situation if we use descriptors other than regular expressions, such 
as deterministic or nondeterministic finite automata. Although enumeration of automata 
has a long history, until recently little attention was paid to enumerating the distinct lan- 
guages accepted. Instead authors concentrated on enumerating the automata themselves 
according to various criteria (e.g., acyclic, nonisomorphic, strongly connected, initially 
connected, ...). 

Here is a brief survey of known results on automata. Vyssotsky [50] raised the ques- 
tion of enumerating strongly connected finite automata in an obscure technical report 
(but we have not been able to obtain a copy). Harary |[T6l enumerated the number of 
"functional digraphs" (which are essentially unary deterministic automata with no distin- 
guished initial or final states) according to their cycle structure; also see Read [45 1 and 
1571 . Harary also mentioned the problem of enumerating deterministic finite automata 
over a binary alphabet as an open problem in a 1960 survey of open problems in enumer- 
ation ifTTl pp. 75,87], and later in a similar 1964 survey lfj"8"l . Ginsburg lfl3l p. 18] asked 
for the number of nonisomorphic automata with output on n states with given input and 
output alphabet size. 

Harrison [20 21] developed exact formulas for the number of automata with specified 
size of the input alphabet, output alphabet, and number of states. Similar results were 
found by Korshunov 11271 . However, in their model, the automata do not have a distin- 
guished initial state or set of final states. Using the same model, Radke ll43l enumerated 
the number of strongly connected automata, but his solution was very complicated and 
not particularly useful. Harary and Palmer [19] found very complicated formulas in the 
same model, but including an initial state and any number of final states. 

Harrison [20. 21] gave asymptotic estimates for the number of automata in his model, 
but his formulas contained some errors that were later corrected by Korshunov [28]. 
For example, the number of nonisomorphic unary automata with n states (and no dis- 
tinguished initial or final states) is asymptotically c(jin)^^T^ n where c = 0.80 and 
r = 0.34. 

Much work on enumeration of automata was done in the former Soviet Union. For 
example, Liskovets [35] studied the number of initially connected automata and gave 
both a recurrence formula and an asymptotic formula for them; also see Robinson l46l . 
Korshunov 12911 counted the number of minimal automata, and 1 30 1 gave asymptotic esti- 
mates for the number of initially connected automata. The 78-page survey by Korshunov 
BTI . which unfortunately seems to never have been translated into English, gives these 
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and many other results. More recently, Bassino and Nicaud [2 1 found that the number of 
nonisomorphic initially connected deterministic automata with n states is closely related 
to the Stirling numbers of the second kind. 

Shallit and Breitbart observed that the number of finite automata can be applied to 
give bounds on the "automaticity" of languages and functions 11481 . Pomerance, Robson, 
and Shallit [42] gave an upper bound on the number of distinct unary languages accepted 
by unary NFA's with n states. Domaratzki, Kisman, and Shallit considered the number of 
distinct languages accepted by finite automata with n states [9]. They showed, for exam- 
ple, that the number of distinct languages accepted by unary finite automata with n states 
is 2 n (n — a + 0(n2~™/ 2 )), where a = 1.3827. (A weaker result was previously obtained 
by Nicaud [40 1 .) Domaratzki [6, 7| gave bounds on the number of minimal DFA's ac- 
cepting finite languages, which were improved by Liskovets [36|. Also see |3|. For more 
details about enumeration of automata and languages, see the survey of Domaratzki [8 1. 



2 On measuring the size of a regular expression 

Although, as we have seen, there has been much work for over 50 years on enumerating 
automata and the languages they represent, the analogous problem for regular expressions 
does not seem to have been studied before 2004 [33 1. We define Rk{n) to be the number 
of distinct languages specified by regular expressions of size n over a fc-letter alphabet. 
The "size" of a regular expression can be defined in several different ways [ 11 1: 

• Ordinary length : total number of symbols, including parentheses, 0, e, etc., counted 
with multiplicity. 

- (0 + 10) * (1 + e) has ordinary length 12 

- Mentioned, for example, in 0] p. 396], 11251 . 

• Reverse polish length : number of symbols in a reverse polish equivalent, including 
a symbol • for concatenation. Equivalently, number of nodes in a syntax tree for 
the expression. 

- (0 + 10) * (1 + e) in reverse polish would be 010 • + *£ + • 

- This has reverse polish length 10 

- Mentioned in JSf) 

• Alphabetic width : number of symbols from S, counted with multiplicity, not in- 
cluding e, 0, parentheses, operators 

- (0 + 10) * (1 + e) has alphabetic width 4 

- Mentioned in ||39l [TUl [341 

Each size measure seems to have its own advantages and disadvantages. The ordi- 
nary length appears to be the most direct way to measure the size of a regular expression. 
Here we can employ the usual priority rules, borrowed from arithmetic, for saving paren- 
theses and omitting the • operator. This favors the catenation operator • over the union 
operator +. For instance, the expression (a • b) + (c • d) can be written more briefly as 
ab + cd, which has ordinary length 5, whereas there is no corresponding way to simplify 
the expression (a + b)(c + d), which is twice as long. The other two measures are more 
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robust in this respect. In particular, reverse polish length is a faithful measure for the 
amount of memory required to store the parse tree of a regular expression, and alphabetic 
width is often used in proofs of upper and lower bounds, compare 11231 . A drawback of 
alphabetic width is that it may be far from the "real" size of a given regular expression. 
As an example, the expression ((e + 0) * + e)* has alphabetic width 0. 

However, these three measures are all essentially identical, up to a constant multiplica- 
tive factor. We say "essentially" because one can always artificially inflate the ordinary 
length of a regular expression by adding arbitrarily many multiplicative factors of e, ad- 
ditive factors of 0, etc. In order to avoid such trivialities, we define what it means for a 
regular expression to be collapsible, as follows: 

Definition 2.1. Let E be a regular expression over the alphabet S, and let L(E) be the 
language specified by E. We say E is collapsible if any of the following conditions hold: 

(1) E contains the symbol and \E\ > 1; 

(2) E contains a subexpression of the form FG or GF where L(F) = {e}; 

(3) E contains a subexpression of the form F + G or G + F where L(F) = {e} and 

e e l(G). 

Otherwise, if none of the conditions hold, E is said to be uncollapsible. 

Definition 2.2. If E is an uncollapsible regular expression such that 

(1) E contains no superfluous parentheses; and 

(2) E contains no subexpression of the form F** . 
then we say E is irreducible. 

Note that a minimal regular expression for E is uncollapsible and irreducible, but the 
converse does not necessarily hold. In ifTTI the following theorem is proved (cf. 11251 ). 

Theorem 2.1. Let E be a regular expression over H Let \E\ denote its ordinary length, 
let |rpn(£')| denote its reverse polish length, and let |alph(£')| denote the number of 
alphabetic symbols contained in E. Then we have 

(a) |alph(£)| < \E\; 

(b) If E is irreducible and |alph(i?)| > 1, then \E\ < 11 • |alpli(i?) | - 4; 

(c) \rpn(E)\^2-\E\-l; 

(d) \E\ <2.|rpn(£0| - 1; 

(e) |alph(£)| < i(|rpn(£)| + l); 

(f) IfE is irreducible and |alph(S)| > 1, then |rpn(£')| < 7 • |alph(J5)| - 2. 



3 A simple grammar for valid regular expressions 

As we have seen, if we want to enumerate regular expressions by size, we first have to 
agree upon a notion of expression size. But even then there still remains some ambiguity 
about the definition of a valid regular expression. For example, does the empty expression, 
that is, a string of length zero, constitute a valid regular expression? How about ( ) or 
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a * * ? The first two, for example, generate errors in the software package Grail version 2.5 
ll44l . Surprisingly, very few textbooks, if any, define valid regular expressions properly or 
formally. For example, using the definition given in Martin [38 , p. 86], the expression 00 
is not valid, since it is not fully parenthesized. (To be fair, after the definition it is implied 
that parentheses can be omitted in some cases, but no formal definition of when this can 
be done is given.) Probably the best way to define valid regular expressions is with a 
grammar. We now present an unambiguous grammar for all valid regular expressions: 



s -> 


E+\E.\G 


E+ -> 


E+ + F\F + F 


F -> 


E.\G 


E. -> 


E, G GG 


G -> 


E*\C\P 


C -> 


| e | a (o € E) 


E* -> 


G* 


P -> 


(S) 



This grammar can be proved unambiguous by induction on the size of the regular 
expression generated. The meaning of the variables is as follows: 

S generates all regular expressions 
E+ generates all unparenthesized expressions where the last operator was + 
E, generates all unparenthesized expressions where the last operator was • (implicit 
concatenation) 

E* generates all unparenthesized expressions where the last operator was * (Kleene 
closure) 

C generates all unparenthesized expressions where there was no last operator (i.e., the 
constants) 

P generates all parenthesized expressions 

Here by "parenthesized" we mean there is at least one pair of enclosing parentheses. 
Note this grammar allows a * *, but disallows (). Once we have an unambiguous gram- 
mar, we can use a powerful tool — the Chomsky-Schiitzenberger theorem — to enumerate 
the number of expressions of size n. 



4 Unambiguous context-free grammars and the 
Chomsky-Schiitzenberger theorem 

Our principal tool for enumerating the number of strings of length n generated by an 
unambiguous context-free grammar is the Chomsky-Schiitzenberger theorem |0]. To state 
the theorem, we first recall some basic notions about grammars; these can be found in any 
introductory textbook on formal language theory, such as Il24l . 
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A context-free grammar is a quadruple of the form G = (V, £, P, S), where V is a 
nonempty finite set of variables, £ is a nonempty finite set called the alphabet, Pisa finite 
subset of V x (V U £)* called the productions, and S € V is a distinguished variable 
called the start variable. The elements of £ are often called terminals. A production 
(A, 7) is typically written A —> 7. A sentential form is an element of (V U £)*. Given a 
sentential form a A/3, where A £ V and a, /3 € (V U £)*, we can apply the production 
A — >• 7 to get a new sentential form cry/J. In this case we write a A (3 =>■ 0:7,8. We write 
=>* for the reflexive, transitive closure of =^>; that is, we write a =>* /3 if we can 
get from a to j3 by or more applications of =>. The language generated by a context- 
free grammar is the set of all strings of terminals obtained in or more derivation steps 
from S, the start variable. Formally, L(G) = {x € £* : S x}. A language is 

said to be context-free if it is generated by some context-free grammar. Given a sentential 
form a derivable from a variable A, we can form a parse tree for a as follows: the root is 
labeled A. Every node labeled with a variable B has subtrees with roots labeled, from left 
to right, with the elements of 7, where B — > 7 is a production. A grammar is said to be 
unambiguous if for each x e there is exactly one parse tree for x; otherwise it is said 
to be ambiguous. It is known that not every context-free language has an unambiguous 
grammar. 

Now we turn to formal power series; for more information, see, for example BP . 
A formal power series over a commutative ring R in an indeterminate x is an infinite 
sequence of coefficients (ao, cii, a 2 , . . .) chosen from B, and usually written o,q + a\x + 
a^x 2 + • • • . The set of all such formal power series is denoted i?[[x]]. The set of all 
formal power series is itself a commutative ring, with addition defined term-by-term, 
and multiplication defined by the usual Cauchy product as follows: if / = ao + aix + 
a 2 x 2 + ■ ■ ■ and g — b + b\x + b 2 x 2 + • ■ ■ , then fg = CQ + c\x + c 2 x 2 + ■ ■ ■ , where 
c n = Yli+j=n a ibj- Exponentiation of formal series is defined, as usual, by iterated 
multiplication, so that f 2 = ff, for example. A formal power series / is said to be 
algebraic (over R(x)) if there exist a finite number of polynomials with coefficients in R, 
ro(x), n(x),..., r n (x) such that 

r (aj)+n(aj)/ + ---+r n (aj)r=0. 

The simplest nontrivial examples of algebraic formal series are the rational functions, 
which are quotients of polynomials p(x)/q(x). Here is a less trivial example. The gener- 
ating function of the Catalan numbers 

f2n\ 

fl X ) = \^ AnJ_ x n+l = x + x 2 + 2x 3 + 5^ + ^5 + ^6 + u2x ? + . 
^ n + 1 

is well known [49] to satisfy f(x) = |(1 — \/l — Ax), and hence we have f 2 — f + x = 0. 
Thus f(x) is an algebraic (even quadratic!) formal series. 

Now that we have the preliminaries, we can state the Chomsky-Schiitzenberger theo- 
rem: 

Theorem 4.1. If L is a context-free language having an unambiguous grammar, and 
a n := \L H £ n |, then X) n >o a n xn is a formal power series in %[[x]] that is algebraic 
over Q(x). 
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Furthermore, the equation of which the formal power series is a root can be deduced 
as follows: first, we carry out the following replacements: 

• Every terminal is replaced by a variable x 

• Every occurrence of e is replaced by the integer 1 

• Every occurrence of — > is replaced by = 

• Every occurrence of | is replaced by + 

By doing so, we get a system of algebraic equations, called the "commutative image" of 
the grammar, which can then be solved to find a defining equation for the power series. 
Oddly enough, Chomsky and Schiitzenbergerdid not actually provide a proof of their the- 
orem. A proof is given by Kuich and Salomaa [32 1 and, more recently, by Panholzer [41 1. 
Let's look at a simple example. Consider the unambiguous grammar- 
s' -> M | U 
M -> QMIM | e 
U -> 0S\0M1U 

which represents strings of "if-then-else" clauses. Then this grammar has the following 
commutative image: 

S = M + U (4.1) 
M = x 2 M 2 + l (4.2) 
U = Sx + x 2 MU (4.3) 

This system of equations has the following power series solutions: 

M = 1 + x 2 + 2x 4 + 5x 6 + Ux s + 42x 10 + • • • 

U = x + x 2 + 3x 3 + 4x 4 + 10x 5 + 15a; 6 + 35x 7 + 56x 8 H 

S = 1 + x + 2x 2 + 3x 3 + 6x 4 + 10x 5 + 20x 6 + 35x 7 + • ■ ■ 



By the Chomsky-Schiitzenberger theorem, each variable satisfies an algebraic equa- 
tion over Q(x), We can solve the system above to find the equation for S, as follows: first, 
we solve (l4~3l to get U = 1 _ S X X 2M , and substitute back in d4~TT ) to get S = M + t 
Multiplying through by 1 - x 2 M gives S - x 2 MS = M — x 2 M 2 + Sx, which, by fijfl l. 
is equivalent to S — x 2 MS = 1 + Sx. Solving for S, we get S = i_ x ± M _ x ■ Now 
(whatever M and x are) we have 

(1 - x 2 M - x) 2 = x 2 (l - M + x 2 M 2 ) - x(2x - 1) - (2x - 1)(1 - x 2 M - x), 

so we get S~ 2 — —x(2x — 1) — (2.x — l)^ -1 and hence 

x(2x - l)S 2 + (2x - 1)S+ 1 = 0. 

This is an equation for S. 
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5 Solving algebraic equations using Grobner bases 

Before introducing the notion of Grobner bases, we describe some of the relevant math- 
ematical notions from the field of commutative algebra. The exposition here is impres- 
sionistic; readers familiar with algebraic geometry will have no difficulty reformulating 
it in more formalized terms. For readers seeking for a more thorough introduction into 
the topic, there are accessible textbooks at the undergraduate level, such as (5); a standard 
graduate level textbook is [22 1 . 

We recall that afield k is a commutative ring with the additional property that multi- 
plicative inverses exist. That is, for any non-zero a £ k, there exists an element b such 
that ab = ba = 1; more informally, one can "divide by a". Familiar examples of fields are 
the rational numbers Q, the real numbers K, and the complex numbers C. On the other 
hand, the commutative ring Z of integers is not a field, and the smallest field containing it 
is Q. 

For our application to the asymptotic enumeration of regular languages, we are inter- 
ested in the commutative ring of formal power series Z[[x]]. This is not a field, but rather 
only a ring — note, for example, that the element 2x does not have a multiplicative in- 
verse. For the purposes of our algebraic framework it is convenient to work with the field 
k = Q((x)) of formal Laurent series over Q. A formal Laurent series is defined similarly 
to a formal power series, with the difference that finitely many negative exponents are 
allowed; an example is 




The following discussion holds for any field k, but for intuition, the reader may prefer to 
think of k = R. 

Given any field k and indeterminates X\, X2, ■ ■ . , X n , there are two important ob- 
jects: 

• the ri-dimensional vector space W — k n over k, with coordinates Xi (1 ^ i ^ n); 
and 

• the ring k\X\, X2, ■ ■ ■ , X n ] of (multivariate) polynomials over k in n indetermi- 
nates. 

For instance, taking k = Q((x)), the polynomial Sx + x 2 MU — U, which we used in the 
previous section in Equation (14. 31 , is member of the ring k[S, M, U]. The corresponding 
vector space W has coordinates S, M, and U. Notice that x is not a coordinate of W, but 
an artifact originating from the way the members of k are defined. 

Given any collection of polynomials T in R, we can define their vanishing set V(F) 
to be the set of common solutions in W; that is, all points [x\ , X2 , ■ ■ ■ , x n ) € W such that 

f(x 1: X2,...,x N ) = for all fe T . 

As an example, let W — M 3 , with coordinates X, Y, Z. Then, the vanishing set of the set 
of polynomials T = {X, Y + 3, Z + Y — 2} is the single point given by (X, Y, Z) = 
(0, —3, 5); the vanishing set of the single polynomial Z — X 2 — Y 2 is an upward-opening 
paraboloid. 

The ideal (J 7 ) generated by a collection T of polynomials is the set of all i?-linear 
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combinations of T\ that is, all polynomials of the form 

Pi ■ fi + Pi ■ h H VVf ft where Pi S R, fi S J" for all i . 

Observe that the vanishing sets of a collection of polynomials and their generated ideal 
are equal: V{F) = V{{F)). 

A term ordering on R is a total order -< on the set of monomials (disregarding coeffi- 
cients) of R satisfying 

• multiplicativity — if u, v, w are any monomials in R, then u < v implies wu -< wv; 

• well-ordering — if J 7 is a collection of monomials, then T has a smallest element 
under -< 

Once a term ordering has been defined, one can then define the notion of the leading 
term of a polynomial, similar to the univariate case. For example, one defines the pure 
lexicographic order on k[X,Y,Z] given by Z -< Y < X to be the ordering where 
X a Y b Z c ~< X d Y e Zf if and only if (a,b,c) < (d,e,f) lexicographically. With this 
ordering, an example of a polynomial with its monomials in decreasing order is 

X 3 + X 2 Y + X 2 Z 7 + Y 9 + 1 ; 

its leading term is X 3 = X 3 Y 9 Z°, and its trailing terms are X 2 Y, X 2 Z 7 , Y 9 and 1. 

Given an ideal /, a Grobner basis B for J is a set of polynomials gi,g 2 ,. . ., fffc such 
that the ideal generated by the leading terms of the gt is precisely the initial ideal of J, 
defined to be the set of leading terms of polynomials in /. It can be shown that B generates 
I. Furthermore, we say that B is a reduced Grobner basis if 

• the coefficient of each leading term in B is 1 ; 

• the leading terms of B are a minimal set of generators for the initial ideal of B; and 

• no trailing terms of B appear in the initial ideal of /. 

Once a term order has been chosen, reduced Grobner bases are unique. Note that in gen- 
eral, there are many term orderings for a polynomial ring R; the computational difficulty 
of a computation involving Grobner bases is often highly sensitive to the choice of term 
ordering used. 

Having established these preliminaries, we turn our attention to solving a system 
of equations given by the commutative image of a context free grammar. Suppose we 
have a context-free grammar in the non-terminals S, N\, N2, ■ ■ ■ , N n . For each non- 
terminal N, let /jv also denote the generating function enumerating the language gen- 
erated by N. Taking k to be the field of formal Laurent series Q((x)), the Chomsky- 
Schiitzenberger theorem implies /jv € k for every non-terminal N. Furthermore, by 
taking the commutative image of the context-free grammar, we obtain a sequence of poly- 
nomials Ps,PNi, ■ ■ ■ ,Pn„, where for every non-terminal N, the polynomial relation 
is the commutative image of the derivation rule for N. Note that every such polynomial 
is in the polynomial ring (Z[x]) [S, Ni, N 2 , ■ ■ ■ , N n ]. 

It follows from the definitions that for every non-terminal N, 

Pn(Is, jWi, /jv 2 , • ■ • , /jv„) = ; 

that is, the (n + l)-tuple (/s, /jvi, /jv 2 : • • • > /aO is a zero of the polynomial pjy. Since 
this holds for every non-terminal N, we can equivalently say that (fs , Jn x , Jn 2 >■■■■> fN n ) 
is in the vanishing set V {I) , where / is generated by the polynomials Ps,Pn 1 ,Pn. 2 , ■ ■ ■ ,PN n - 
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Our aim is to determine an algebraic equation satisfied by the power series fg. To do 
this, we find a Grobner basis B for /, using an elimination ordering on the indeterminate 
S. The defining property of any such term ordering is that the monomials involving only 
the indeterminate S are strictly smaller than the other monomials; namely, those involving 
at least one of N2, . . . , N n . By the Chomsky-Schutzenberger theorem and the prop- 
erties of Grobner bases, the smallest polynomial p in B will be a univariate polynomial 
in the indeterminate S. Since p £ I, and (fs, /iVi , }n 2 > • ■ • 1 /iv n ) is in the vanishing set 
V(I), we see that p(fs) = 0; that is, p — is an algebraic equation satisfied by fs- (Note 
that in previous sections, we simply use S to denote fs-) 

As an example, we use Maple 13 to compute such an algebraic equation for the exam- 
ple grammar in the previous section. We give the commands, followed by the produced 
output. The commutative image of the grammar is entered as a list of polynomials, given 
by 

> eqs := [ -S + M + U, -M + x*2*M*2 + 1, -U + S*x + x A 2*M*U ]; 

eqs := [-S + M + U, —M + x 2 M 2 + 1,-U + Sx + x 2 MU] . 

Maple provides an elimination ordering called lexdeg; to compute a reduced Grobner 
basis using this ordering, we enter the command 

> Groebner [Basis] (eqs, lexdeg ( [M, U] , [S])); 

[1 + (-1 + 2x) S + (-x + 2 x 2 ) S 2 , 1 + (-1 + x) S + Ux, -1 + (1 - 2x) S + Mx] . 

The algebraic equation satisfied by S is the first polynomial in this set: 

> algeq := % [1] ; 

algeq := 1 + (-1 + 2 x) S+ (-X + 2 x 2 ) S 2 . 

To compute the Laurent series zeros of S using this polynomial, we solve for S and 
expand the solutions as Laurent series in the indeterminate x: 

> map (series, [solve (algeq, S) ] , x) ; 

[(-.T- 1 -l-a;-2 x 2 -3 x 3 ~6 x 4 -W x 5 +0 (x 6 )), (l+x+2 x 2 +3 x 3 +6 x 4 +!0 x 5 +0 (x 6 ))} . 
Our desired power series solution is the second entry in the above returned list. 



6 Asymptotic bounds via singularity analysis 

If L is a context-free language having an unambiguous grammar and f(x) = Y^, a n x n is 
the formal power series enumerating it, then f(x) is algebraic over Q(x) by Theorem l4.ll 
The previous section gave a procedure for computing an algebraic equation satisfied by 
/; that is, we are able to determine a non-trivial polynomial P(x, S) £ Z[cc, S] such that 
P(x, f(x)) = 0. This section describes how singularity analysis can be used to determine 
the asymptotic growth rate of the coefficients a n . We sketch some of the requisite notions 
from complex analysis and provide a glimpse of the underlying theory; more details can 
be found in Flajolet and Sedgewick |[T2l . 
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The usefulness in considering complex analysis is that the formal power series f(x), 
defined purely combinatorially, can be viewed as a function defined on an appropriate 
open subset of the complex plane C. Such a function is called holomorphic or (complex) 
analytic; this reinterpretation of f(x) allows us to apply theorems from complex analysis 
in order to derive bounds on the asymptotic growth rate of the a n far tighter than what we 
could do with purely combinatorial reasoning. 

Indeed, assume that L is an infinite context-free language — then there exists a real 
number < R < 1 called the radius of convergence for f(x). The defining properties of 
R are that: 

• if z is a complex number with \z\ < R, then the infinite sum a + a\z + a 2 z 2 + 
CL3Z 3 + ■ ■ ■ converges; and 

• if z is a complex number with \z\ > R, then the infinite sum a + a\z + a 2 z 2 + 
CL3Z 3 + ■ ■ ■ diverges. 

We note that the definition says nothing about the convergence of o-i z% when \z\ = R. 
Thus, defining U to be the open ball of complex numbers z satisfying \z\ < R, we can 
reinterpret / as an analytic function on U. The connection between the asymptotic growth 
of the coefficients a n and the number R is given by two theorems. 

Theorem 6.1 (Hadamard). Given any power series, R is given by the explicit formula: 



limsup^^ |a„| 1 /« ' 
The defining properties of lim sup state that 

• for any e > 0, the relation | o.„ | x /"^ < -3- + e holds for sufficiently large n; and 

• for any e > 0, the relation la,^ 1 /" > — e holds for infinitely many n. 

For our situation in particular, this implies that up to a sub -exponential factor, a n grows 
asymptotically like l/R n . (This implies that for any e > 0, we have a n e 0((j^ + e) n ) 
and a n iO{{\ - e) n ). 

We note that Hadamard's formula applies to any power series, not just to generating 
functions of context-free languages. 

An elementary argument shows that our assumption that L is infinite implies R ^ 1; 
similarly, our assumption that L is context-free (and thus algebraic) implies R > 0. (The 
argument for showing R > is harder, and is sketched here for those familiar with 
complex analysis. The algebraic curve given by P(z,y) = determines d branches 
around z = and the power series f(x) = J2 n a n x n must be associated with one such 
branch. Since the exponents of f(x) are non-negative integers, this must be an analytic 
branch at 0; hence, f(x) determines an analytic function at and must have positive 
radius of convergence.) 

The second theorem describes the convergence of the power series f(x) on the circle 
given by \z\ = R. A dominant singularity for f(x) is a point zo on this circle such that 
the sum a nZo diverges; the following result says that a positive (real-valued) dominant 
singularity always exists. 

Theorem 6.2 (Pringsheim). Let f(x) = ^2 n a n x n be a power series with radius of 
convergence R > 0. If the coefficients a n are all non-negative, then R is a dominant 
singularity for f(x). 
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The benefit of Pringsheim's theorem is that, for the sake of determining R, it suffices 
to examine the positive real line for the singularities of f(x) considered as a function, not 
just as a power series. We make this more precise now, by introducing the concept of a 
multi-valued function. 

Suppose that the power series f(x) is algebraic of degree d over Q(x) — under the 
assumption that P is irreducible, this means that the degree of the polynomial P(x, S) G 
Z[x, S] in the variable S is d, and we may write 

P = q n S n + qn-iS™- 1 + q n - 2 S n - 2 + ■■■+«,, 

where each is a polynomial in Z[x] and q n is non-zero. (If P is reducible, factor it and 
replace it by an appropriate irreducible factor.) 

If we work in the algebraically closed Puiseux series field U«>i C((x 1 /")), we obtain 
d roots of P(x, S) = 0, say, g\ (x), g2(x), ■ . ■ , gd{x), one of which coincides with f(x). 
In general, these roots will not be power series with non-negative integer coefficients, but 
instead will be more generalized power series with complex coefficients and (possibly 
negative) fractional exponents. 

Let D(x) G Z[x] be the discriminant of P with respect to the variable S; this is readily 
computed via the formula 

C_i \n(n-l)/2 p. 

D-t-L Re 8 (R^P.S). 

Here, Res denotes the resultant of two polynomials, defined to be the determinant of a 
matrix whose entries are given by the coefficients of the polynomials. The theoretical 
importance of D is that it satisfies the identity 

D(x)=ql^H(g i (x)-g j (x)) . 

Define the exceptional set 5 of P to be the complex zeros of D; note that this is a finite 
set. For every point z in the complement C \ S, where D does not vanish, there exist 
d distinct solutions y to the equation P(z, y) — 0. Furthermore, the d distinct solutions 
vary continuously with z, and a locally continuous choice of solutions locally determines a 
branch (which is locally an analytic function) of the algebraic curve cut out by P(z, y) = 
0; this is how a multi-valued function arises. 

On the open set U, which we have defined to be the set of points z satisfying \z\ < R, 
one such branch is given by our initial power series f(x). By Pringsheim's theorem, 
f(x) diverges at R; this shows that f(x), considered as a analytic function on U, has no 
analytic continuation to a function on an open set containing U U {R}. According to the 
discussion above, this shows that R must be in the exceptional set S. 

We have given a method to calculate an upper bound for the growth rate of the a n ; in 
particular, we have shown parts (1) and (2) of: 

Theorem 6.3. Let f{x) — ^ n a n x n be a formal power series where a n ^ for each n. 
Suppose P(x, S) = is a non-trivial algebraic equation satisfied by f(x), and let D be 
the discriminant of P with respect to S. Then, exactly one of the positive real roots R of 
D satisfies the following properties: 
(1) for any e> 0, a n G 0{{± + e) n ); 
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(2) for any e > 0, a„ £ 0((^ - e)™); a«d 

(3) ifD has no zero zo ^ R such that \zq\ = R, then for any e > 0, a„ 6 f2((i — e) n ). 

We remark that part (3) is much more difficult to show; it is implied by the stronger 
result that if D has no zero zq ^ R such that \zq\ = R, then there exists a polynomial p 
such that a n ~ • (^) . 

Given the list p\ < p-2 < ■ ■ ■ < pk of positive real-valued elements of H, there remains 
the task of selecting which pj to use to provide an upper or lower bound. The bigger j 
is, the better our upper bound will be; however, for this bound to be valid, we must 
ensure that pj ^ R. For our purposes, we simply employ a boot-strapping method — if 
is known beforehand that a n £ 0(n s ) for some s, then we simply choose the minimal 
j such that 1/ Pj ^ s; equivalently, pj 1/s. If this is not possible, we simply pick 
3 = 1. Similarly, for a lower bound, we choose the maximal j such that pj ^ 1/t if it is 
known that a n € f2(n*). (With much more work, one can precisely identify R — Flajolet 
and Sedgewick [ 12 1 describe an algorithm "Algebraic Coefficient Asymptotics" that does 
this.) 

As an illustration, we continue the Maple example in the previous section to derive an 
asymptotic upper bound for the example grammar. We first recall the algebraic equation 
satisfied by S: 

> algeq; 

1 + (-1 + 2x) S + (-x + 2 x 2 ) S 2 . 

We compute the discriminant D: 

> d := discrim (algeq, S) ; 

d:= -(2a: + l)(-l + 2a:) . 

The real roots of D are given by: 

> realroots := [ f solve (%)] ; 

realroots := [-0.5000000000, 0.5000000000] . 

Finally, an upper bound is given by taking the inverse of the smallest positive real root: 

> 1/min (op (select (type, realroots, positive))); 

2.000000000 . 
Hence, a n e 0((2 + e) n ) for any e > 0. 

7 Lower bounds on enumeration of regular languages by 

regular expressions 

We now turn to lower bounds on Rk(n). In the unary case (k = 1), we can argue as 
follows: consider any subset of {e, a, a 2 , ... , a' -1 }. Such a subset can be denoted by 
a regular expression of (ordinary) length at most t(t + l)/2. Since there are 2* distinct 
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subsets, this gives a lower bound of R\{n) ^ 2 v/ ^"~ 1 . Similarly, when k ^ 2, there are 
k n distinct strings of length n, so Rk(n) ^ k n . These naive bounds can be improved 
somewhat using a grammar-based approach. 
Consider a regular expression of the form 

wi(e + w 2 (e + w 3 {e + ...))) 

where the Wi denote nonempty words. Every distinct choice of the Wi specifies a distinct 
language. Such expressions can be generated by the grammar 

S -> Y\Y(e + S) 
Y -> aY\a, a G S 

which has the commutative image 

S = Y + YSx 4 
Y = kxY + kx . 

The solution to this system is 

kx 



S = 



1 — kx — kx 5 



Once again, the asymptotic behavior of the coefficients of the power series for S depend 
on the zeros of 1 — kx — kx 5 . The smallest (indeed, the only) real root is, asymptotically 
as k — > oo, given by 

y (-^(TK -^n 1 1,5 35 
^ 4t + 1 k k 5 k 9 fc 13 

The reciprocal of this series is 

V 4 (m) = , , 1 _ 1 , 26 _ 204 1771 _ 

^ 5(5^ + 4) A: 3 jfc 7 fc 11 /c 15 fc 19 

For k = 1 the only real root of 1 — fez — fcx 5 is approximately .754877666 and for k = 2 
it is about .4756527435. Thus we have 

Theorem 7.1. R^n) = 0(1.3247") and R 2 {n) = 0(2.102374"). 



7.1 Trie representations for finite languages 

We will now improve these lower bounds. To this end, we begin with the simpler problem 
of counting the number of finite languages that may be specified by regular expressions 
without Kleene star of size n. Non-empty finite languages not containing e admit a stan- 
dard representation via a trie structure; an example is given Fig. 1 1 (a)| 

The words in such a language L correspond to the leaf nodes of the trie for L; more- 
over, the concatenation of labels from the root to a leaf node gives an expression for the 
word associated with that leaf node. For regular languages L and M, we write M^ 1 L to 
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(a) Representing the finite language (b) Representing the infinite language 

01 (2 + 34 + 5) + 67 (e+8) as a trie. 1 * ( 2 + 34 * + 5 ) + 67 * (e+8 ) as a starred 

trie. 



Figure 1. Example of a trie representation for a finite language (see Section lTTTT i and 
of a starred trie representation for an infinite language (see Section l7T2l i. 

denote the left quotient of L by M; formally 

M~ l L = {v : there exists u £ M such that uv G L}. 

If M consists of a single word w, we also write w~ 1 L instead of {w}~ 1 L, and w~ n L 
instead of (w n )~ 1 L. 

For notational convenience, we take our alphabet to be £ = {ao, a\, . . . , a^-i}, 
where k > 1 denotes our alphabet size. A trie encodes the simple fact that each nonempty 
finite language L not containing s can be uniquely decomposed as L — (J i ciiLi, where 
Li = a~ x L, and the index i runs over all symbols a; € £ such that Li is nonempty. 
This factoring out of common prefixes resembles Homer's rule (see e.g. |26, p. 486]) for 
evaluating polynomials. We develop lower bounds by specifying a context-free grammar 
that generates regular expressions with common prefixes factored out. In fact, the gram- 
mar is designed so that if r is a regular expression generated by the grammar, then the 
structure of r mimics that of the trie for L(r) — nodes with a single child correspond to 
concatenations, while nodes with multiple children correspond to concatenations with a 
union, see Table Q] 





Y | Z 


E -> 


Y\{Z)\{e + S) 


Y -» 


P t fox0^i<k 


Z -> 


Pno "T" Prix ~T" ' ' ' ~\~ Pnt 




where ^ uq < n\ < ■ ■ ■ < n t < k for t > 




a,; | diE for ^ i < k 



Table 1. A grammar for mimicking tries with regular expressions. 



The set of regular languages represented corresponds to all non-empty finite languages 
over £ not containing the empty string s. We briefly describe the non-terminals: 
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ordinary 


reverse polish 


alphabetic 


1 


0(1.3247") 


0(1.2720") 


0(2") 


2 


0(2.5676") 


0(2.1532") 


0(6.8284") 


3 


0(3.6130") 


0(2.7176") 


0(11.1961") 


4 


0(4.6260") 


0(3.1806") 


0(15.5307") 


5 


0(5.6264") 


0(3.5834") 


0(19.8548") 


6 


0(6.6215") 


0(3.9451") 


0(24.1740") 



Table 2. Lower bounds for Rk (n) with respect to size measure and alphabet size. 
S generates all non-empty finite languages not containing e. 

E generates all non-empty finite languages containing at least one word other than e. 
Y generates all non-empty finite languages (not containing e) whose words all begin with 

the same letter. The for loop is executed only once. 
Z generates all non-empty finite languages (not containing e) whose words do not all 

begin with the same letter. 
Pi generates all non-empty finite languages (not containing e) whose words all begin 

with dj. 

We remark that this grammar is unambiguous and that no regular language is repre- 
sented more than once; this should be clear from the relationship between regular expres- 
sions generated by the grammar and their respective tries. 

(Note that it is possible to slightly optimize this grammar in the case of ordinary length 
to generate expressions such as + 00 in lieu of 0(e + 0), but as it results in marginal 
improvements to the lower bound at the cost of greatly complicating the grammar, we do 
not do so here.) 

Table [2] lists the lower bounds obtained through this grammar. In this table (and only 
this table), each 0(fc") in the column corresponding to reverse polish notation should 
be interpreted as "not 0(k n )" — observe, for instance, that all strings produced by our 
grammar for a unary alphabet have odd reverse polish length. 

Remark 7.2. Using the singularity analysis method explained in Section [6] these lower 
bounds were obtained by boot-strapping off the trivial bounds of 0(fc"), 0(fc"/ 2 ) and 
0(fc") for the ordinary, reverse polish length and alphabetic width cases, respectively. 

Before we generalize our approach to cover also infinite languages, we derive a for- 
mula showing how our lower bound on alphabetic width will increase along with the 
alphabet size k. 

To this end, we first state a version of the Lagrange implicit function theorem as a 
simplification of lfl4l Theorem 1.2.4]. If f(x) is a power series in x, we write [x n ]f(x) 
to denote the coefficient of x n in f(x); recall that the characteristic of a ring R with 
additive identity and multiplicative identity 1 is defined to be the smallest integer k such 
that Yli=i 1 = 0, or zero if there is no such k. 

Lemma 7.3. Let R be a commutative ring of characteristic zero and take <j>{\) G -R[[A]] 
such that [X°](f> is invertible. Then there exists a unique formal power series w(x) G i?[[a;]] 
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such that [x ]w = and w = x(j>(w). For n ^ 1, 

[x n }w(x) = -[A n - x ]^ n (A). 

Due to the simplicity of alphabetic width, the problem of enumerating regular lan- 
guages in this case may be interpreted as doing so for rooted fc-ary trees, where each 
internal node is marked with one of two possible colours. We thus investigate how our 
lower bound varies with fc. 

More specifically, consider a regular expression r generated by the grammar from the 
previous section and its associated trie. Colour each node with a child labelled e black and 
all other nodes white. After deleting all nodes marked e, call the resultant tree T(r). This 
operation is reversible, and shows that we may put the expressions of alphabetic width n 
in correspondence with the fc-ary rooted trees with n + 1 vertices where every non-root 
internal node may assume one of two colours. In order to estimate the latter, we first prove 
a basic result. The first half of the following lemma is also found in [12 p. 68]. 

Lemma 7.4. There are — ( k-ary trees of n nodes. Moreover, the expected number 
of leaf nodes among k-ary trees ofn nodes is asymptotic to (1 — l/k) k n as n — > oo. 

Proof. Fix fc > 1. For n ^ 1, let a n denote the number of fc-ary rooted trees with n 
vertices and consider the generating series: 

f(x) = ^2 a n x n . 

By the recursive structure of fc-ary trees, we have the recurrence: 

f(x)=t(l + f(x)) k . 
Thus, by the Lagrange implicit function theorem, we have 

a „ = [x"]/(.) = i[A«- 1 ](l + A) fe ™ = if kn ). 

n n \n — 1J 

We now calculate the number of leaf nodes among all fc-ary rooted trees with n vertices. 
Let b n , m denote the number of fc-ary rooted trees with n vertices and m leaf nodes and 
c n the number of leaf nodes among all fc-ary rooted trees with n vertices. Consider the 
bivariate generating series: 

g(x, y) = ^ b n , m x m y n . 

By the recursive structure of fc-ary trees, we have the recurrence: 

g(x,y) = y(x - 1 + (1 + g{x,y)) k ) . 



18 



H. Gruber, J. Lee, J. Shallit 



The Lagrange implicit function theorem once again yields 



k\n 



d 

= |--[A' l - 1 ](x-l + (l+.9( a; ,y)) fc ) 
Ox n 

= -[A"" 1 ] |-(x-l + (l + A) fe )" 
n ox 

= [A^Kl + A)**" -1 ) 
'k{n- If 
n — 1 

Thus, the expected number of leaf nodes among n-node trees is, asn-Jco while having 
k fixed, 



Cn " 



I kn \ 
Vn-1/ 



k-1 



□ 



We wish to find a bound on the expected number of subsets of non-root internal nodes 
among all fc-ary rooted trees with n nodes, where a subset corresponds to those nodes 
marked black. Fix k ^ 2. Since the map x H> 2 X is convex, for every e > and 
sufficiently large n, Jensen's inequality (e.g., [47 Thm. 3.3]) applied to the lemma above 
implies the following lower bound on the number of subsets: 

2 (l-(l-l/fc) fc -e)n _ 

Since — (1 — l/k) k > — 1/e for k ^ 1, we may choose e > such that 

-(1 - l/k) k -e > -1/e. 

This yields a lower bound of 2( 1 ~ 1 / e )™. 

Assuming k ^ 2 fixed, we now estimate („ fc ! l 1 ). By Stirling's formula, we have, as 

n — > oo, 



kn 



e 



n-lj \\(k-l) 

Putting our two bounds together, we have the following lower bound on the number of 
star-free regular expressions of alphabetic width n, when n — » oo while keeping k fixed: 



■ 2 (l-l/e) k k 
(k-l) k - 1 
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7.2 Trie representations for some infinite regular languages 

We now turn our attention to enumerating regular languages in general; that is, we allow 
for regular expressions with Kleene stars. 

Our grammars for this section are based on the those for the star-free cases. Due to the 
difficulty of avoiding specifying duplicate regular languages, we settle for a "small" sub- 
set of regular languages. For simplicity, we only consider taking the Kleene star closure 
of singleton alphabet symbols, and we impose some further restrictions. 

Recall the trie representation of a star- free regular expression written in our common 
prefix notation. With this representation, we may mark nodes with stars while satisfying 
the following conditions: 

• each starred symbol must have a non-starred parent other than the root; 

• a starred symbol may not have a sibling or an identically-labelled parent (disregard- 
ing the lack of star) with its own sibling; and 

• a starred symbol may not have an identically-labelled child (disregarding the lack 
of star). 

The first condition eliminates duplicates such as 

0*11*0*1*0* -H- 0*1*0*11*0*; 
the second eliminates those such as 

01*^0 (e+11*) and 0(1 + 2*1) ^02*1 
and the third eliminates those such as 

0*0 <-> 00* . 

In this manner, we end up with starred tries such as in Fig. |l(b)| Algorithm Q] illustrates 
how to recreate such a starred trie from the language it specifies. 



Algorithm 1 STAR-TRIE(L) 

Require: e^L,L^% 
1: create a tree T with unlabelled root 
2: for all a e £ such that a^ 1 L ^ do 

3: append STAR-TRIE-HELP(a _1 .L, a) below the root of T 
4: end for 
5: return T 



Let T be any starred trie satisfying the conditions above. Then T represents a regu- 
lar expression, which in turn specifies a certain language. We now show that when the 
algorithm is run with that language as input, it returns the trie T by arguing that at each 
step of the algorithm when a particular node (matched with language L if the root and ah 
otherwise) is being processed, the children are correctly reconstructed. 

We first consider children of the root. By the original trie construction (for finite 
languages without e), no such children may be labelled e. Thus, by the first star condition, 
the only children may be unstarred alphabet symbols. Thus, line 2 of Algorithm[T]suffices 
to find all children of the root correctly. 
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Algorithm 2 STAR-TRIE-HELP(X, a) 



1: create a tree T with root labelled a 

2: for all b e S s.t. &- x £ ^ do 

3: if (b- n L) n (e + (E \ {&})£*) ^ for all n ^ then {need a child labelled 6*} 

4: append a new b* -node below the root of T 

5: if Z, ^ b* then { fe* will be an internal node} 

6: for all c e £ \ {&} such that cr 1 L ^ do {determine children of 6* } 

7: append STAR-TRIE-HELP(c~ 1 L, c) below the &*-node 

8: end for 

9: if b e L then 

10: append a new e-node below the &*-node 

ii: end if 

12: end if 

13: else {need a child labelled 6} 

14: append STAR-TRIE-HELP(6 _1 L, b) below the root of T 

15: end if 

16: end for 

17: if e € £ and the root of T has at least one unstarred child then 

18: append a new e-node below the root of T 

19: end if 

20: return T 



Now consider a non-root internal node, say labelled a. By the third star condition, a 
starred node may not have a child labelled with the same alphabet symbol, so if a has a 
child labelled b*, then 

(6T 1 Lfl(e + (E\{()})S*) is non-empty for all n ^ 0. (7.1) 

Conversely, by the second condition, a starred node may not have an identically-labelled 
parent that has e as a sibling, so if ( 17.1b holds, then a must have a child labelled b*. 
By the second star condition, a starred node may not have siblings, so the algorithm 
need not check for other children once a starred child is found. This shows that line 3 
of Algorithm [2] correctly identifies all starred children of a. Assuming a has a starred 
child b*, then by the third condition, line 6 of Algorithm|2]correctly recovers all children 
of b*. All remaining children of a have no stars, and line 14 of Algorithm [2] suffices to 
find all children labelled with neS; the special case of an e-child below a is covered by 
line 17. 

We give a grammar that generates expressions meeting these conditions in Table [3] 
As before, we take our alphabet to be S = {ao, ax, ... , a,k-i}- We describe the roles of 
the non-terminals of the grammar in Table [3] 

S generates all expressions — this corresponds to Algorithm Q] 

E, Ei generate expressions that may be concatenated to non-starred and starred alphabet 
symbols, respectively. The non-terminal E corresponds to lines 2 and 13 while 
Ei corresponds to line 5 of Algorithm [2] These act the same as S except for the 
introduction of parentheses to take precedence into account and restriction that no 
prefixes of the form e + aa* are generated, used to implement the second condition. 
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s -> 




E ->• 


F | (Z) | (e + Y') | (e + Z) 




F 4 | (Z t ) | (e + Y() | (e + Z«) for i < k 


Y -> 


Pi for sC i < k 


y -» 


Pi for < i < k 


*< -> 


Pj for ^ i , j < k and i ^ j 




Pt for ^ i, j < k and i ^ j 


z -> 


pi + p' j u p f 




where ^ no < ni < ■ ■ ■ < nt < k for t > 




pi + p> J Lp' 




as above, but with rij ^ i for all ^ j ^ t 


Pi-* 


a, a,E ma* a l a*E j 




for ^ i, j < k 




a, a,E ma* a l a*E. J 




for ^ i, j < k and i ^ j 



Table 3. A grammar generating all regular expressions meeting all three star condi- 
tions. 



Additionally, Ei has the restriction that its first alphabet symbol produced may not 
be ai — this is used to implement the third condition. 
Y, Y' ,Yi, Y( generate expressions whose prefix is an alphabet symbol. As a whole, these 
non-terminals correspond to Algorithm |2] and may be considered degenerate cases 
of Z and Z%\ that is, trivial unions. 

The tick-mark signifies that expressions of the form aa* for a £ £ are disallowed, 
used to implement the second condition. The subscripted i signifies that the initial 
alphabet symbol may not be <ij, used to implement the third condition. 
Z, Zi generate non-trivial unions of expressions beginning with distinct alphabet symbols 
— Z corresponds to line 2 of Algorithm Q] and line 13 of Algorithm [2] while Zi 
corresponds to line 5 of Algorithm[2] 

The subscripted i signifies that none of initial alphabet symbols may be Oj, used to 
implement the third condition. 
Pi, PI generate expressions beginning with the specified alphabet symbol a*. They cor- 
respond to line 1 of Algorithm|2] 

The tick-mark signifies that expressions may not have the prefix a^a*, used to im- 
plement the second condition. 

Since the algorithm correctly returns a trie when run on the language represented by 
the trie, the correspondence between the algorithm and the grammar gives us the following 
result. 

Theorem 7.5. The grammar above is unambiguous and the generated regular expressions 
represent distinct regular languages. 
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k 


ordinary 


reverse polish 


alphabetic 


1 


51(1.3247") 


0(1.2720") 


0(2") 


2 


Sl(2.7799") 


0(2.2140") 


0(7.4140") 


3 


0(3.9582") 


0(2.8065") 


0(12.5367") 


4 


0(5.0629") 


0(3.2860") 


0(17.6695") 


5 


0(6.1319") 


0(3.6998") 


0(22.8082") 


6 


0(7.1804") 


0(4.0693") 


0(27.9500") 



Table 4. Improved lower bounds for Rk{n) with respect to size measure and alpha- 
bet size. 

Table |4] lists the improved lower bounds for Rk(n). These lower bounds were ob- 
tained via singularity analysis, as explained in Section|6] boot-strapping off the bounds in 
Tabled 

8 Upper bounds on enumeration of regular languages by 

regular expressions 

Turning our attention back to upper bounds for Rk (n), we develop grammars for regular 
expressions such that every regular language is represented by at least one shortest regular 
expression generated by the grammar, where a regular expression r of size n is said to be 
shortest if there is no expression r' of size less than n with L(r) = L(r'). 

To this end, we consider certain "normal forms" for regular expressions, with the 
property that transforming a regular expression into normal form never increases its size. 
Again, size may refer to one of the various measures introduced before. With such a 
normal form, it suffices to enumerate all regular expressions in normal form to obtain 
improved upper bounds on Rk in) for various measures. 

8.1 A grammar based on normalized regular expressions 

We begin with a simple approach, which will be further refined later on. As concatenation 
and sum are associative, we consider them to be variadic operators taking at least 2 ar- 
guments and impose the condition that in any parse tree, neither of them are permitted to 
have themselves as children. Also, by the commutativity of the sum operator, we impose 
the condition that the summands of each sum appear in the following order: First come 
all summands which are terminal symbols, then all summands which are concatenations, 
and finally all starred summands. Also, we can safely omit all subexpressions of the form 
s**, s* + e, (s + e)*, s + e + e: occurrences of these can be replaced with occurrences 
of s*, s*, s*, and s + e, respectively. Here the latter subexpressions have size no larger 

'The Maple worksheets used to derive these bounds can be accessed at the second author's personal home- 
page via http : / /math . Stanford, edu/ ~ j lee/ automat a /| 
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9 — y 


O 1 4 1 T 1 f 1 ?(" 


<y -> 




A — > 


T + A T I C + Ac I K + A K 


At — > 


T IT + At \ An 


A c -> 


C | C + A c | Ajf 


A^ -> 


K\K + A K 


T -> 


at | a 2 | • • • | a k 


C^ 


Co Co | Co C 


C ^ 


(<3) I (A) I T I K 


if ->■ 


{A)* | T* | (C)* 



Table 5. A simple unambiguous grammar for generating at least one shortest regular 
expression for each regular language. 





(Q)C q I (A)CU | TC T I ^C K 


Cq-> 


(Q) I (Q)Cq I c A 




A | (A)C A | C T 


C t -> 


T | TC T I C K 


C K ^ 


^|XCk 



Table 6. Rules for concatenation over unary alphabets, which in that case is com- 
mutative. 



than the former ones, and this holds for all size measures considered. These observations 
immediately lend themselves for a simple unambiguous grammar, such as the one listed 
in Table|5] The meaning of the variables is as follows: 

S generates all regular expressions obeying the abovementioned format. Among 
them, 

Q generates those expressions of the form r + e, 
A generates those of the form r + s, i.e. "additions", 
T generates those which are terminal symbols, 
C generates those of the form rs, i.e. concatenations, 

Co generates the "factors" apppearing inside concatenations (which are themselves not 

concatenations), and 
K generates those of the form r* , i.e. Kleene stars; 

finally, the "summands" in expressions of type A are subdivided into subtypes At, Ac 
and Ak, used for handling summands which are terminal symbols, concatenations, or 
Kleene stars, respectively. 

In the special case of unary alphabets, not only union, but also concatenation (again 
viewed as a variadic operator) is commutative. In this case, we may impose a similar 
ordering of factors as done for summands, and thus we can replace the rule with C as 
left-hand side with the rules given in Table [6] 
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8.2 A grammar based on strong star normal form 

We now refine the above approach by considering only regular expressions in strong star 
normal form JT3], a notion that we recall in the following. 

Since is only needed to denote the empty set, and the need for e can be substituted 
by the operator L' = L U {e}, an alternative syntax introduces also the ' -operator and 
instead forbids the use of and e inside non-atomic expressions. The definition of strong 
star normal form is most conveniently given for this alternative syntax. 

Definition 8.1. The operators o and • are defined on regular expressions. The first oper- 
ator is given by: a° = a, for a e E; (r + s)° = r° + s°; r'° = r°; r*° = r°; finally, 
(rs)° = rs, if e ^ L(rs) and r° + s° otherwise. The second operatoris given by: a* = a, 
for a e S; (r + s) m = r' + s m ; (rs)* = r's'; r*' = r'°*\ finally, r ? ' = r*, if e € L(r) 
and r 7 * = r* ? otherwise. The strong star normal form of an expression r is then defined 
as r*. 

An easy induction shows that the transformation into strong star normal form pre- 
serves the described language, and that it is weakly monotone with respect to all usual 
size measures. We sketch a proof for the case of ordinary length. 

Lemma 8.1. Let r be a regular expression without occurrences of the symbol 0, and let 
r m be its strong star normal form. Then ord(r*) ^ ord(r). 

Proof Sketch. First of all, we may safely assume that r does not contain any subexpres- 
sions ruled out by the grammar of the previous section, such as e + e; the transformation 
into strong star normal form subsumes these reductions anyway. 

Recall the definition of the auxiliary operator in the definition of strong star normal 
form (Definition 18. U . The proof relies on the following claim: If e E L(r) and L(r) ^ 
{e}, then ord(r°) ^ ord(r) — 1; otherwise, ord(r°) ^ ord(r). This claim can be proved 
by induction while excluding the cases L(r) = 0, {e}. The base cases are easy; the 
induction step is most interesting in the case r = st. If e ^ L(st), then r° = st and 
the claim holds; otherwise r° = s° + t° with e £ L(s) and e € L(t). We can apply the 
induction hypothesis twice to deduce ord(s°) + ord(i°) ^ ord(s) + ord(£) — 2, and thus 
ord(s° + t°) ^ ord(si) — 1, as desired. Notice that, as union has lower precedence than 
concatenation, this step never introduces new parentheses. The induction step in the other 
cases is even easier. □ 

Since every regular language is represented by at least one shortest regular expression 
in strong normal form (with respect to all three considered size measures), it suffices to 
enumerate those expressions in normal form. Our improved grammar will be based on 
the following simple observation on expressions in strong star normal form: 

Lemma 8.2. If s* or s + e appears as a subexpression of an expression in star normal 
form, then e ^ L(s). □ 

To exploit this fact, for each subexpression we need to keep track of whether it denotes 
the empty word. This can of course be done with dynamic programming, by using rules 
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5+ 












S+ -> 


Q + 


\A+\C+\K+ 


s- 




A- 


\T-\C- 








A~ 


+ e | T~ +e | C- +e 














A+ -> 


T~ 


+ A+ | C~ + A+ | 
















A~ 


+ A+|C+ + A+| 


A~ 


-> 


T" 


+ A T | C" - 


f A c 




At -> 


K+ 

C+ 


+ A+ 

1 C+ + At 1 At 


A;; 


— >• 
— >. 


T~ 
C~ 


| T- + A T | 
\ C~ + AT, 






A+^ 


K+ 


i#++a+ 
















T~ 


-> 


ai 


a,2 | • • • | a* 






C+-> 




c + 1 c+c+ 


c*- 


-> 


Cr7 
Cr7 


Co C(l 
C~ | Cq C+ 


|C + 
|C + 


c " 1 
c- 


C+^ 


(Q 4 


) 1 1 K+ 




-> 


(A- 


-)|T- 






K+ -> 




-)*\T-*\ (C - )* 















Table 7. A better unambiguous grammar generating at least one shortest regular 
expression (in strong star normal form) for each regular language. 



such as e € L(rs) iff e G L(r) and £ G £(s). Since in addition every subexpression either 
denotes the empty word or not, it is easy to extend the above grammar to incorporate these 
rules while retaining the property of being unambiguous. 

Notice that most variables now come in an e-flavor (for example, the variable A + ) and 
in an e-free flavor (for example, the variable A - ). Moreover, the summands inside sums 
appear in the following order, which is a refinement of the summand ordering devised 
previously: First come all summands which are terminal symbols, then all summands 
which are e-free concatenations, then all concatenations with e in the denoted language, 
and finally all starred summands. To illustrate this ordering, we give the most important 
steps of the unique derivation for the expression a% + 0,20,3 + (04 + e) (05 + e) + a 6 *: 



S A- +A+=>T- + A T +A+=> ai +A T + A+ 
=► at + A c + A+ => at + C~ + A+ a x + a 2 a 3 + A+ 
=> at + a 2 a 3 + C + + Aj ==>* at + a 2 a 3 + (a 4 + e)(a 5 + e) + Ag 
=> at + a 2 a 3 + (04 + e)(a 5 + e) + A\ =>■ a x + a 2 a 3 + (a 4 + e)(a 5 + e) + K + 
=^>* 01 + a 2 a 3 + (a 4 + e)(ag + e) + a 6 * 

The following proposition, giving the correctness of the improved grammar, can be 
proved by induction on the minimum required regular expression size. Table [8] lists the 
upper bounds obtained through this grammar^ 



2 The Maple worksheets used to derive these bounds can be accessed at the second author's personal home- 
page via http : / /math . Stanford, edu/ ~j lee/ automat a /| 
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Proposition 8.3. The grammar in Table [7| is unambiguous and, for each regular lan- 
guage, generates at least one regular expression of minimal ordinary length ( respectively: 
reverse polish length, alphabetic width) representing it. □ 



k 


ordinary reverse polish alphabetic 


1 

2 
3 
4 
5 
6 


0(2.5946") 0(2.7422") 
0(4.2877") 0(3.9870") 
0(5.4659") 0(4.7229") 
0(6.5918") 0(5.3384") 
0(7.6870") 0(5.8780") 
0(8.7624") 0(6.3643") 


> 0(k n ■ 21.5908") 



Table 8. Summary of upper bounds on Rk(n) for k = 1, 2, . . . , 6 and various size 
measures. For ordinary length, we used the simple grammar in Table [5] because 
the computation for the improved grammar ran out of computational resources. For 
reverse polish length, we used the simple grammar for bootstrapping the bounds. 



k 


ordinary 


reverse polish 


alphabetic 


1 


0(2.1793") 


0(2.0795") 


0(10.9822") 


2 


0(3.8145") 


0(3.3494") 






3 


0(4.9019") 


0(4.0315") 






4 


0(5.8234") 


0(4.6121") 




> 0(k n • 12.2253") 


5 


0(6.8933") 


0(5.1268") 






6 


0(7.9492") 


0(5.5939") 







Table 9. Summary of upper bounds for k = 1, 2, .., 6 and various size measures 
in the case of finite languages. For reverse polish length, we bootstrapped from the 
values in Table [8j for ordinary length, we bootstrapped the case k = 2 from the 
upper bound obtained for k = 3. 



9 Exact enumerations 

Tables [TOl to [131 give exact numbers for the number of regular languages representable by 
a regular expression of size n, but not by any of size less than n. 

We explain how these numbers were obtained^ Using the upper bound grammars 
described previously, a dynamic programming approach was taken to produce (in or- 
der of increasing regular expression size) the regular expressions generated by each non- 
terminal. To account for duplicates, each regular expression was transformed into a DFA, 

3 The C++ source code of the software used to compute these numbers can be accessed at the second author's 
personal homepage via http : / /math .Stanford . edu/~ j lee/automata/ 
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minimized and relabelled via a breadth-first search to produce a canonical representation. 
Using these representations as hashes, any regular expression matching a previous one 
generated by the same non-terminal was simply ignored. 



k 


1 


2 


3 


4 


k 


1 


2 


3 


4 


1 


3 


4 


5 


6 


1 


3 


4 


5 


6 


2 


1 


4 


9 


16 


2 


2 


6 


12 


20 


3 


2 


11 


33 


74 


3 


3 


17 


48 


102 


4 


3 


28 


117 


336 


4 


4 


48 


192 


520 


5 


3 


63 


391 


1474 


5 


5 


134 


760 


2628 


6 


5 


156 


1350 


6560 


6 


9 


397 


3090 


13482 


7 


5 


358 


4546 


28861 


7 


12 


1151 


12442 


68747 


8 


8 


888 


15753 


128720 


8 


17 


3442 


51044 


354500 


9 


9 


2194 


55053 


578033 


9 


25 


10527 


211812 


1840433 


10 


14 


5665 


196185 


2624460 


10 


33 


32731 


891228 





Table 10. Ordinary length, finite lan- Table 11. Ordinary length, general 
guages case 



k 


1 


2 


3 


4 


k 


1 


2 


3 


4 


1 


3 


4 


5 


6 


1 


3 


4 


5 


6 


3 


2 


7 


15 


26 


2 


1 


2 


3 


4 


5 


3 


25 


85 


202 


3 


2 


7 


15 


26 


7 


5 


109 


589 


1917 


4 


2 


13 


33 


62 


9 


9 


514 


4512 


20251 


5 


3 


32 


106 


244 


11 


14 


2641 


37477 


231152 


6 


4 


90 


361 


920 


13 


24 


14354 


328718 


2780936 


7 


6 


189 


1012 


3133 


15 


41 


81325 


2998039 




8 


7 


580 


3859 


13529 


17 


71 


475936 






9 


11 


1347 


11655 


48388 


19 


118 


2854145 






10 


15 


3978 


43431 


208634 



Table 12. Reverse polish length, fi- Table 13. Reverse polish length, gen- 
nite languages eral case 



10 Conclusion and open problems 

In this chapter, we discussed various approaches to enumerating regular expressions and 
the languages they represent, and we used algebraic and analytic tools to compute upper 
and lower bounds for these enumerations. Our upper and lower bounds are not always 
very close, so an obvious open problem (or class of open problems) is to improve these 
bounds. Other problems we did not examine here involve enumerating interesting sub- 
classes of regular expressions. For example, in linear expressions, every alphabet symbol 
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k 


1 


2 


3 


4 





2 


2 


2 


2 


1 


2 


4 


6 


8 


2 


4 


24 


60 


112 


3 


8 


182 


806 


2164 


4 


16 


1652 


13182 


51008 


5 


32 


16854 


242070 


1346924 


6 


64 


186114 


4785115 





Table 14. Alphabetic width, finite 
languages 



7 


1 


2 


3 


4 





2 


2 


2 


2 


1 


3 


6 


9 


12 


2 


6 


56 


150 


288 


3 


14 


612 


3232 


9312 


4 


30 


7923 


82614 


357911 


5 


72 


114554 


2332374 




6 


155 


1768133 







Table 15. Alphabetic width, general 
case 



occurs exactly once. In addition to the intrinsic interest, enumerating subclasses may 
provide a strategy for improving the lower bounds for the general case. 
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Abstract. In this chapter we discuss the problem of enumerating distinct regular expressions by 
size and the regular languages they represent. We discuss various notions of the size of a regular 
expression that appear in the literature and their advantages and disadvantages. We consider a 
formal definition of regular expressions using a context-free grammar. 

We then show how to enumerate strings generated by an unambiguous context-free grammar 
using the Chomsky-Schiitzenberger theorem. This theorem allows one to construct an algebraic 
equation whose power series expansion provides the enumeration. Classical tools from complex 
analysis, such as singularity analysis, can then be used to determine the asymptotic behavior of the 
enumeration. 

We use these algebraic and analytic methods to obtain asymptotic estimates on the number of 
regular expressions of size n. A single regular language can often be described by several regular 
expressions, and we estimate the number of distinct languages denoted by regular expressions of 
size n. We also give asymptotic estimates for these quantities. For the first few values, we provide 
exact enumeration results. 



